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The sequence reads a\ erayed o\ er 400 bases and sampled the genome w idi an 

a\ erage spacing of once e\'ery 5.000 bases. A total of 339.243 bases of unique sequence w as generated 
(approximately 7% representation). 1 he sample of 870 sequences was compared to the complete 
Escherichia coli K-12 genome and to the rest of the CienBank database, w hich can also be considered a 
collection of sampled sequences. Despite the incomplete S. /v/;/?/ data set, interesting categories could 
easil\ be discerned Sixteen percent of the sequences determined iVom S. iyplii had close hcnnoNigs 
among know n Scilnionc/lci sequences ( /' I c~^'* in HkistX or BlastN ). rellecling the proportion of liiese 
genomes that ha\'e been sequenced pre\ iousl\ : 277 sequences (32**o) liad no apparent t^rthologs in the 
complete co/i K-12 genome {P > 1 of w hich 1 55 sequences ( 1 8'^ o) had no close similarities to 
any sequence in the database {P • Ic"^""). I'ight of the 27"^ sequences had similarities to geiio in other 
strains of/:, co/i or plasmids. and six sequences showed e\idence of no\el phage lysogens or sequence 
remnants of phage imegrations. including a member of the lambda famil> iP < lc'~^^^). 1 went\ -ihree 
sample sequences had a significanth closer similarit} a sequence in the database from organisms other 
than the L. co/i Sci/n]()fic//c{ clade (which includes Slii<^c//ii and ( 'ifro/nicfcr). fhese sequences are new 



► ABSTRACT 



Raw sequence data representing the maJorit\' of a bacterial genome can be 
obtained at a tin\' fraction of the cost of a completed sequence. I'o 
demonstrate the utility of such a rescnirce. 870 single-stranded Ml 3 clones 
were sequenced from a shotgun librar\ of the SalDioiici/a typhi 1 y2 genome. 
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CLiiididalc lalcra! iransTcr c\onts U) ihc \. /lyvO/ liiicagc rii- dclclirms on the A', c^;// K-12 Imcaiic. Idcxcii 
putali\ c junclion^ of iiL^crtion deletion c\cnl> Lircalcr than 100 bp were obsciAcd ii^ the sainplc. 
indicating that well over 150 such events may distinguish .V. lyphi from l\ coli l\-12. The need tor 
automatic methiuis to more elTecti\ el\ exploit sample seqiieiK'es is discussed. 

► INTRODUCTION 

1 he complete sequencing ol bacterial genomes has re\ olutioni/ed ^ Al)striK-t 

microbioloeN . Ph)\\e\er. the current hiijh cost orcompleleK sequencini: ' hitrfnlutiion 

genomes has hmited Us application to important pathogens and commercial!} „, Uesitits X: I)iscus>i(nj 
important bacteria. The majorit) of this cost is incurred because of the Kt^ ti rc tKcs 

labor-intensixe metlKuls which must still be u^ed ti^ close gaps covering the 
last tew percent of the genome and io reduce the error rale to below 

In contrast, a partial sequence of a bacterial genome can he obtained at low cost (39). Our costs of 
sequencing indicate that a random sample of sequences equixalcnt to the si/e o]\i genome ( 1 - coverage) 
can be obtaiiied at 1 to of the cost ofcomplele seiiuencing of the genome. Such a 1 ■ '*NLimple 
sequence" captures approximatel}' 63'* o ol'the genome in at least one sti-and. the av erage contig si/e is 
about 690 bases, and the average gap si/e about 400 bases fusing equations in relereiu'c M}). The number 
(^r clones required tor sample sequencing olTi I ■ genon^e equiv alent is directlv related to genome si/e. 
i'or example, the genome oWSa/inonc/Ui ivphi. which is 4.7S \lbp (14). vvould require about 1 2.000 reads 
ol"400 bases. Similaiiv . a 2 • sample sequence costs about 2 to 4" u of a completed sequence and 
represents about 86% of the genome, d he av erage contig si/e is about 1.280 bases, and the av erage gap 
si/e is 200 bases. In a bacterium, this level of sampling w ould ensure that almost ev erv cistron w as 
represented among the sample sequences. 

The low- cost t>f partial ccw erage of gent^mes makes it pi)ssible to ct^nsider sample seciuencing of 
multiple genomes w ithin a species, genus, c^r lamilv . W hen a completelv sequenced genome and a 
closely related sample-seqtienced genome are compared, it is possible to identify sequences in the 
sampled genome that are absent in the completelv sequenced genome. In bacteria, ev olutionarv 
mechanisms include the lateral transfer of cistrons and other units manv kilobases in length, sometimes 
from distant species or phage. Thus, the presence of entire cistrons in one genome that are absent in a 
related genome is a quite common occurrence in bacteria, and these differences ol\en contribute to the 
dilTerences in life strategies of related species (|S. If multiple loci are available from multiple 
related species, then it is also possible to idcntilv ^ome of the loci that ap|XMr to ha\ e a ph\ logenv 
difterent trom that of the rest of the genome. I hese are potential lateral transfers of genes or cistrons to 
the lineage of one genome or deletion ev ents in the completelv sequenced genonie that hav e occurred 
since thev- div erged frc^n their common ancestor. The v ast Cicnf^ank database can be ciMisidered a huge 
collection of sample sequences for these purposes. 

llci'c. wccln>se.V i\phi for a pilol samplc-scqucnciiiL' clVorl because its genome is^I^^lT. icialcJ lo 
completely sequenced genome, namelv . that oi Kschcrichici coli K-12 (6). and because it is closelv 
related to the partiallv' completed sequence oWScilnioncl/a lyphiniiiriu})} (45 ). I he majoriiv of the .V, typhi 
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and cY^// genomes are probably related by descent from their common ancestor, and these regions 
share an a\erage ofabout S5'N) identit\ at the nueleolide le\el and are even more eonseiAcd at the amines 
acid le\el (50). Pi'cx iou> ^tLkhe^ u>ed dl^c^cpallclc^ in ilic .iliL'iimcnlN o! ihc ni.ip- ^'1 

and coli or DNA-DNA h} bridi/ation between these genomes to estimate that an> where Irom 20 lo 
50" (» of these genomes max not be related b> descent from their common ancestor (1 L 20. 42> ). Indeed. 
e\en within the Salmonella cfifcrica group (which includes S. typhi), up to 20% the genome has been 
estimated to consist of genes that are not shared betw een pairs of strains (20). 

S typhi is of particular interest because it causes txphoid lex er, a sev ere and someiimes fatal disease in 
humans, fhe onlv kntnxn elVectiv e hi^st is humans, and U\Klitii)nal methods lor ^tudving \ irulence 
mulations m model hosl.s arc not adequate. 1 hus. sample sequencing of.S' /r/^/// could be particularl> 
illuminating because it can identily candidates for genes invoked in v irulence. 

¥ MATERIALS AND METHODS 

Clonin*^ and scqucTicin^. Fiv e micrograms of gent^iuc DN.A was sonicated. \hstr;iet 
end repaired- fractionated, and subcloned m M 1 3 dcNci-ibcO pre\ louslv i"t''^'^i^i^-0''" 
(56): 1,050 subclones were purified and sequenced bv a nuorescence-based 
sequencing method. Sequencing used standard shotgun librarv production. Rercrcnccs 
automatic plaque picking and DN A preparation, and short reads on an ABl 

377 DNA sequencer, 'fhe cost was estimated at Si. SO per sequence read (SI. 10 lor supplies and SO. 70 
for labor). Yielding a total cost of S2. 000 for the sequence production. 1 he success rate (percentage of 
subclones that provided higli-qualitv sequence) for this librarv was (S2.1%. The resulting raw sequence 
reads were processed bv the program Automated Sequence Processor to remove hnv-qualitv traces and 
the X-Windows version of the (ienome Assembly Program to assemble any overlapping reads, fhe 6% 
redundancy observed for the librarv is expected at this lev el of sampling (approximately 0.()7-lold) (30). 
These shotgun sequencing methods are more fullv' described elsewhere (56). 

Comparison to the GenBank database. At present there is a lack of good tools to pick out the most 
interesting from among a large number of sample sequences bv' using comparison w ith completed 
genomes and w ith other sequences in the database. I luis. we adapted data from tlie most i-cadilv 
available tools, the Blast suite ol^ programs (reference 3 and references therein). 

Hach sample sequence was compared to the complete genome sequence of/:, coli K-12 (6) by using 
BlastN 1 .0 and TBlastX 1 .0 and to the E. coli K-12 open trading frame (ORf ) sequences bv' using 
BlastX 2.0. in addition, each sequence was compared to the entire GenBank nucleotide and amino acid 
databases w ith BlastN and BlastX. respecti\elv\ using the Blast server at the denomc Sequencmg 
Center, W ashington I iiiv ersitv . Si. bonis. Mo. 1 hese data are av ailable at 

hitp:' .uenonie.w itstl.edu lisc ba cterial Sa lmoiK-lla .htmf fhe data were further processed to show onlv the 
most significant match for each sequence, using Microsoft Word 6.0 with the assistance of macros. In 
some cases, matches with previously sequenced Salmonella sequences were removed llrst. fhe best hits 
for each search were entered into an Excel spreadsheet. 
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Significance thresholds for putative ortholo^^s in L\ coli K-12 and putati\ e paralo^^ous 
comparisons. The 870 sampled sequences were ranked b\ the signifieanee sc^mv of their niateh with the 
/■;. a)li LieiKHiie. Putati\e orlholoi^s were detlned eiiipiricall\' as those nialehes that aehie\ed /' ■ \c 

Lisinu either lilaslN. IMastX. or TBkistX. A similar proeess was used to deteriniue the number of 
orthok)gs with Scilnionc/hi sequences in the database. I he threshold chosen was based on ihe fact that 
w ith sequence reads ot'2()() to 500 bases, this siynificance score alwa\ s translated to a homoloi:}' of 
greater than 60% nucleotide or amino acid identitx ^pannang 2(J0 or more bases, which is within the 
range expected lor orthologous comparisons (50). 

Alignments that yielded scores otV > le~-^^ in all the three Blast search methods generall\ represented 
less than 6()'^) nucletUide identit} o\er a span ofaboai 2()(i ba.^e> or le>^ llian u anmu* a^id idenui\ tii 
a span or60 amino acids (or a lower similaritx in a longer alignment). When scores ol /' ■ \ c 
occurred in an S. typhi versus E. coli K-12 comparison, these were classified as putatix e paralogous 
comparisons. 

Detecting potential hiteral transfer and deletion events. For each of the 870 sample sequences, the 
ratios ofthe most signillcant score in /2 coli K-12 and the most significant score in the rest of the 
(jenHank database (otiier than SninioiicUci) were deleinimed and ranked. The-e ratios were ealeulaied 
from both BlastN and BlastX scores. The best examples of potential lateral transfer or deletion ev ents 
were identified b\ llrst considering onl\' those sample sequences (i) that had a significance of match ol 
P < \e'^^ in either BlastN or BlastX with an organism other than coli K-12 (or SLiUuoucUa) and (li) 
where the match with this other organism had at least a lO.OOO-fold greater significance score than the 
best match in £ coli K-12. using both BlastN and BhistX. Amino acid similarities in the text are 
reported as single ratios that accumulate all nonox erhipping patches of similarilx delected b\ the BkisiX 
program. 

► RESULTS AND DISCUSSION 

As a preliminary demonstration ofthe utility of bacterial sample sequence 
resources, we sequenced 1.059 Ml 3 clones from S. iyphi Fx 2. fhere xxcre 
870 reads of acceptable qualitx'. The ax erage read length w as ox'cr 400 bases 
These 870 clones iiKdded into 791 contigs of 339.243 bp. representing about 
7% ofthe genome. 

The sequences xxere searched against the entire CjenBank database, including the completed li coli K-12 
genome, using BlastX and BlastN. We found a continuum of similarities, ranging from a high degree of 
homologx to no significant similaritx . refecting different ex olutionary origins ox different rates of 
dix ergence o{ the sequences. 

Ofthe 870 sample sequences, a total of 1 35 ( 1 0^)) had a presumed ortholog among sequences m xarituis 
SabuoncUa seroxars that were alreadx' in the database (/' ' lc~^^') ( fable 1). Sequences with these Blast 
scores reflect the cumulatix e proportion of genomes trom x arious Salnionciui serox ars that had 
prex'iouslx been sequenced in targeted projects. 
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TAHLK 1. Siniilanlics otW /ly;/;/ sample sequences lo sequences in ihe 
. \'icu this tabic: public databases 
I I in this w iiulow | 
I I in a new w indow | 

1 bere were 41 1 sequences (47'Mj) ibal had hiL!bl\ siunilicanl honiolouies with the coniplete sequence of 
the A. coli 1\-12 Licnonie (/' • Ic^~^^^. These are presiiniabl\ orlhologs thai di\ erL!ed !rt)ni a common 
ancestor oH:. coli and \ /v/;///. aUbousib il is also possible thai a lew of these sequences are lateral 
transler e\ents between these two lineayes since tbeii- di\ eryence. Idie latter e\ ents would be 
characterized by exceptional!} high conserx ation oTDXA sequence compared to that lound lor t\ pical 
ortbologs. which are about 15% divergent in ON A sequence. 

A total of 593 (6X*^ o ) of the sample sequences had hoiuologies w ith E. coli K-12 llial were n]ore 
significant than P - U"-^^ (Table 1). Thus. 227 sequences (32'^ oj had a less sigmticani iiomologv w iih 
/3 coli 1\-12. At a threshold of/' > \c~^-^\ the match between sequences is suriicientl\ poor that thc\ 
ma\' not retlcct true orthologs (by descent) between SnlnionclU/ and coli e\en after considering 
random fixation of mutations and errors in the sample sequences. Thus. 32-o is perhaps an undereslimaie 
of the total amount of sequences in these genomes that are not homologous b\ descent from a common 
ancestor. 

Most or all of these 227 sample sequences are presumabl\ from "loops" in the S. ivplu ^enome that 
distinguish this genome from the /3 coli K-12 genome. Some of these 227 sequences ma\' have been 
acquired by S. typhi since the divergence from the common ancestor w ith E. coli. while others ma\ hax e 
been preserved in at least part of the Salmonella lineage but deleted in at least the K-12 part of the /3 coli 
lineage. An^iong these 227 sample sequences, there are at least 10 examples that matched sequences in 
known loops from Salmonella that had alread\' been characterized by other researchers, f or example. 
hb59d()6.sl and hb5^)dlf),sl are almost identical to parts of /7A(; (Cd)P-gluc(^se 4.6-deli\ dratase Liene) 
from the 0-antigen cluster of 5. (yphimiirium (22). hb59hl0.sl and hb6()e06.sl are almost identical to 
the ssaR gene from the t\pe III secretion s\ stem apparatus of X {yphimiirium. This is part of 
pathogenicit}' island 2 and is not present in /:. coli K-12 or Salmonella hon^^ori. which dn erged at the 
first branch point in the Salmonella lineage. Pathogenicit\ island 2 was probabl> acquired after this 
di\ ei'gence b\' horizontal transler Irom an unknow n source ( 19). 

One interesting subset olThis class of nonorthologs is the sequences that ha\ e no apparent homologs in 

the entire database. There were 155 sequences ( ihal had similarities less signillcant than lc~-\ a 

le\ el at which the significance of an\ alignments are unreliable. These entircK tun el sequences of no 

known function which occur in S. typhi but not E. coli K-12 presumablx include some genes cncodinLi 
no\el functions. 

Three-way eomparisons. Pairw ise comparisons of Blast signiticance scores are far from a foolproof 
strategy- to detect p(Mential lateral transfer and deletion exents. AlthouLih most kntiwn '^ene^ in ihc 
Salmonella and /3 coli genomes are closelx homologous and these shared genes axeraue about S5^J 
identit) at the nucleotide le\el. the random fxation of mutations means that genes that are related b\ 
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descent Irom the coniiiioii ancestor \ ar\ \\idel\ around liie mean oi*85'N) uientiix. i^aierai ti'aii>ier i> an 
onLioini: process in these species and can occur between l\. coli and SiilnioncilcL between one ol these 
species and other closel)' related genomes (such as ( urolnicicr and Sh'r^clla). or between one of these 
species and a more distant!}' related genome. I hus. similarities between paralogous genes (including 
those due to lateral transfer) lie on a continuum that o\ erlaps the similarities between genes that are 
related b\ descent rrv)m the conmion ancc>lor. As a ■■h^^..\|Lk-iiee. est i males ^ »l llie !e \ el ^! i.;lei .i! n i - !c! 
h\ using comparisons between sample sequences from X ivphi and the cc^mplcte l\ coli K-12 genome 
(or. in generah between an\ two genomes) are inherenil) unreliable. Another serious hmitation ol using 
Blast scores lor the w hole read length of each sample sequence to rank sample sequences is that the 
sample sequences ha\'e dilTerein read lengths and the significance scores are sensiti\ e to the length, of the 
homolog}' detected. In the anaNses discussed abo\e. we ha\e tried to a\(>id these problems b\- adjusting 
the signillcance thresholds to rellect these facts. I lowex cr. the>e limitations can be more elfecli\el\ 
mitigated b\ using a phxlogenetic comparison with a third species. 

The key to identifying no\el paralogous comparisons is to have a third reference sequence from an 
outgroup species. In most cases, closely related sequences in two ingroup species w ill be more similar to 
each other than either is to an\ sequence in the outgroup species. Mowe\ er. if this is not true, then a 
potential lateral transfer or deletion e\ ent is re\ ealed. It might be argued that the sample sequences are 
short and contain occasional errors, and so this might be an unreliable strateg\-. Ilowexer. it should be 
noted that insertion deletion error.s in llie sample sequence w lii be ''pri\ ate" ( i.e.. uniidorinaln ej. .awl 
accidental matches of miscalled bases will occur with approximately equal frequencx \n each true 
homolog. Both t\ pes of error w ill not t} picall> bias a match to one homolog in the database \ ersus 
another. The best matches detected will t\ picalK rellect the closest similarities tlnit would be seen if the 
sample sequence were error free, although the apparent genetic distance of the sampled sequence ma\ be 
exaggerated by sequencing errors. 

The best examples of potential lateral transfer or deletion e\ ents are discus.sed below . I hese examples 
were identified b\ stringent criteria in which the f^last score m I:, coli was much less significant than the 
Blast score for some other sequence in the Genliank database (see Materials and Methods). The criteria 
undoubtedK' remox ed some legitimate exaniples of potential lateral transfer c\ents (or examples of 
deleted sequences in the K-12 lineage where the best score would be a paralogous comparison). 
Nevertheless, these criteria concentrated the search tow ard the best-supported examples. 

(.)nl\- new relationships that could not be deduced pre\ musIn fi'om the .sequences ahead} in die daui'ha^cs 
are discussed below and presented in 'lable 2. fhus. those sample sequences that were homologous to 
known Salmonella sequences were dropped from ct^nsideration. 

TABLE 2. Comparison of .V (yj)hi sample sequences w ith the 

: Mew this table: public databases^^ 
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Sequences found in some /:. cy>// strains but not found in strain K-12. I sing llic abov c critcri.i. \\c 
found ciglu S. ivplii sample sequences thai had belter niatehes with sequeMiees Wvm A'. loH strain^ other 
than I'., coli 1\-12 and uhieh did not oeeur in known Si/lniniiclla sequeiiees { I able 1). ( lones 
hb53h()5.sl. hb5()gl 2. si. and hb37el().sl are ^nnilar io iliree en/_\iiies in an aronialie dej^radabxe 
pathway ol^some coli strains w here the elusler of Lienes occurs as an insertion relati\ e to /:. coli K- 1 2 
(4L 42. 44). This is presuniabK an example ofhow some /:'. coli strains, and apparentlx at least this one 
Salmonella strain. ha\ e beeonie adapted to a new nutritional source by the recruitment of a catabolie 
cassette. 

The strain S ivphi that we used has no known pki-iiiid-.. erlhcless. tlierc were lour N^mip!'^' 
sequences thai ha\'e their closest similarities to Lienes WnwM.] on plasmids in K. coli. hb56b()7.sl has 
similarit)' lo the immunit}' protein of the C'oll:7 plasmkl ( K^) which is found in a tew strains of/:', coli 
(56/74 1 76%] similarit)' to this 84-amino-acid peptide). hb57li()l .s2 has similaril} to the transfei" operon 
gene. tJxiF, of the conjugative F plasmid found in some coli strains ( 57). hb54b06.sl has a patch of 
moderate similarit}' (47.71 [66%j) to a plasmid-assv)eiated ckiaperone gene of entcroaggregali\e E. coli 
(47). hb62d06.sl has an ORf similar to ip:^/) oWShc^clio \onnci (cumulati\ e <S7 1 "^7 |64''()| smiilaril\ ). 
the gene for a secreted protem on a \ irulence plasmid pi-o.\imal to /;;.v/ (2). hb55b()9.sl is highl\ similar 
to the iransposon- and plasmid-borne citrate ulili/ation gene. ci(B. found in KlchsiclUi spp. and in some 
E. coli strains other than K-12 (21). Another clone. hb53b()7.sl. may be more cIoscIn related lo ciiB in 
Klebsiella similarity in 131 amino acids) than E coli (73% similaril} in 141 aniino acids). 
Tricarboxylic acids are used in man} Salmonella sei'ox ars but not in /:'. coli: cilrale is used as a carbon 
source. The icf operon at 60 min is sequenced in Salmonella. 1 he /c/// genes and cifB and cifA map at 
1 7 min but [ire not sequenced m S. (\phimiiriiim L 1 2: thus, these genes probabk are present m 
.v. iyphimurnim and in Eleh.siella but missing from E. coli. 

Probably the sequences noted abo\'e are integrated in the S. ivphi genome rather than being on a 
pre\'iously unknow n plasmid. Some of these sequences presumabl}' represent lurlher examples of 
sequences that are found on plasmids in one bacterium but in the genomes of other bacteria. Sequences 
recruited to the genome h\ integration of plasmids are probabl} a major source of the loops that 
distinguish bacterial genomes. 

Sequences similar to phage. The .V. typhi T} 2 genome does not haxe an} pre\ iousi} known integrated 
prophage. Ne\ertheless. limited sequence similaril} to various bacteriophages or ivlrons was lound in 
this genome. hb58gl0.sl has some similaril} lo a relron-associated sequence (32) and lo a bacteriophage 
P2 putativ e v ertex protein (44 '55 [8()%| similaril} ) (3j ). This is the llrsl example of a sequence that ma} 
be associated with a retron in Salmonella. hb55gl().sl has some similaril}' to an coli retron meth} lase 
(92 144 [64'* o] simitaritv). It is pc^ssible that this is an(Mher relron-ass(Knated sequence in .V l\phi. It is 
relalcil to Ja/;/ from Scrra/iLi Du/rccscens (cumulaiiv e 7f) 124 |6()*S)| similaril} ). \ ivphuiumnini 
(81 123 [66%J similarity), and coli 123 [59'\,| similaril} ) but is less related li) these proteins than 
the} are to each other (>8()% similaril} over 200 amino acids). 7 he other end was selected for 
sequencing and has a patch of D\A sequence 143 149 (95*'o) identical lo a sequence in the same E coli 
retron and a conceptual translation which is 35 43 {8r^(j) identical lo phage 1^2 lerminase ATPase 
subunit. 
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hb53cU5..>l has soiiio siinilarit>- to IxiclcriophaLic P2 in pi'v^poscd cndt>!iLic!casc suhunil oflcniiiiiasc 
(71 113 |^>3**(.| Miiiilarilx ) ( li), hh>:^d()N.^l is ivlalcd lu llie luiin licuc of baclcru tplui^c \ lu i h-^" 'I 

idcnlitics). Tiic/^^i//;/ ycnc encodes a protein which pioiccts linear double-slranded DN.A from 
exonuclease degradation in \ ilro and in \\\o iU. hb()()b()3.sl has sniii!aril\ to a eohpliaiie l.S() i^iiuili\e 
tail fiber assenibl}- protein (61 ''(S() |71"o| ainuio acid sinidariix ) {^). iMnall). hb^Tu l().s2 is about 6U'\, 
identical to a tail protein of bacteriophage lambda at the I)\A !e\el and identical at llie protein 
le\el. 

Despite these similarities to phage, it should be nolcil ihat the ixpical l\sogenie phage is niun\ tens of 
kiloba.scs in lengdi, and so an\ CiMiiplete prophage m llie geiuMue should \ leld a luiinber o! sec|uences 
from a sample of the si/e that we used (one clone e\ er\ 5 kb). Thus, complete genomes from close 
homologs of know n bacteriophages are unlike!) lo occur in the r\'2 genonie. d he sequences ob.>er\ed 
ma\ be remnant parts of an ancient prophage that the genome lias preser\ed lor its own need^. 

Sequences similar to other enterobacteriaceae. Another set of sample sequences ha\ e their closest 

similarities m tlic dalabLise ii' NC^^iuenccN duit liuixe I'^^^w ^ i un\;v ;cri/ed m clUcrt MX^Ui ij ^u; ,iJic Mic 
E. colt Slii'^ellci Salmonella Ciirohacfcr Such sequence^ are ofparticular inleresi because thc> 

ma\' represent deletion e\enls in the E. coli K-12 lineage or insertion events in the \ typhi lineage so that 
the gene in Salmonella shares an ancestor w ith an organism oilier than E. coli. father w a\; their 
phylogen}' may not be the same as the ph\ logen\' c^fthe majority oi'the genome shared by Salmonella 
and Escherichia, dlie similarities in the database can gi\e clues as to the l\inction ol the^c NCiiucnccs in 
\. l\phL a junction that ma\ noi ha\e an exact C(ninicrnarl in the /:'. coli \\-\2 genome. 

Among this class of sample sequences are a number tiial are most closel\ similar to genes m the close 
sibling genus. Klebsiella. hb57b09.sl and hb91 flO.s! contain ORbs with patches ofclose similarit} to 
cilA (8). the sensor kinase, in Klehsiella { 1 1 3/134 and 1 53/1 72 [84 and 89%| similarit) . respecti\ el\ ) 
and much less similarit) to the /:. coli K-12 sensor kinase gene ciiA (51/93 and 1 1 f 162 [55 and 69^b)| 
similarity. respecti\'el) ). 

hb53d()8.sl and hb58d()] .si contain dilTercnt porlions oTan ^ )lvl- that is closcl) related lu ihc ai) l.^ullaLc 
sultbtransferase of the ph) logenelically \ er)- distant bacterium Campylohacicr jejuni ( 1 08 I 35 and 
132' 152 [80 and 87%] amino acid similarity, respectix el) ) (59). These ORb^s are less related to the 
Klehsiella protein in the sauK^ region (■^;70"o similarit) } (4). Iib57h08.s2 contains seeiuences thai ha\e 
similarit) to the arx lsiilfate suirotranslerase protein on\lehsiella (66/95 [69'^u] similarit) ). hb58b02.sl is 
almost identical to the disulfide isomerase oi^ Klehsiella in the same aiAisulfale metabolic complex. 

To obtain turlher supporting e\ idence lor a paralogous comparison, as part ot\)iir elfort to sequence the 
entire genome olW tvphimiiriunn the sequencing of the region corresponding to the ar) Isullalase in 
/:. coli K-12 is in progress (unpublished data). This gene and an adjacent regulator protein are absent in 
the corresponding part ofthe S. ivphinuirium genome, lurther supporting the possibilil) thai the 
Salmonella and //, coli genes are different in phxiogenetic histor)- and location in the genome. 

hb58e07.sl and hb62d05.sl ha\ e similarit) tt> I'fbl), a putati\e protein of unknown function (36 38 and 
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()3 72 |V5 and SS''u| siniiiaril\. rcspccu\ci\ ) in ihc (i.()-kh /'//^ Licnc ciusicr from KicnMciici pncnnionuic 
sci\)l\ pc ( ) 1 {rlhK/)i ) I ). I hl^ cluster contains six Licncs w ho^c producl^ arc rcqiinvd tor tiic hios\ iitiicsis 
ol"a lipopolvsaccliaiadc O antigen ( U). Iil^()2!()4.s 1 l^ cIoncI\ related to anothei" yene \n the eluslci'. /7/'/'- 
encoding the ealaclosN Itransferase protein o\' K. pficuiuoniac ( 133156 |85'*o| siniilarit\ }. This cluster 
would be expecled to be missing from coli K- 1 2. and inan_\ other enteric organism^, w hich do not put 
rhamnose ii^lo their lipopoh saccharide. An rjh cluster has been cloned from .V, lyphiniiirinn} ( 22). but the 
K/chsiclld rfhF imd S. typhi sequences are not related to this cluslei'. 

Some rather unexpected ^imdarlt!e^ oecui" bel\\eei5 ^;^//' ^-.;!nple ^eL|Lie!!ee^ .nui -.\|i.!e::^\'- ii: 
enterobacleria or related proleobacteria. 1-or example. hb55rn3.sl is remarkable in that it shares \ er\ 
signillcant similaritx w ith indolep\ ru\ ate decarboxx lase iVom l-jucrohiicwr clocicc/c. anoihei- 
enterobacterium f 164'21 7 at the DNA le\el: cumulatixe patches i>rsimilarit\ of I 1 I 1 24 |''X)'^o| at the 
amino acid le\'cl). This enz\ me is used to con\'ert indole-3-p\ ruvic acid to indole-3-acetic acid, a 
well-known plant hormone. Other entcrobactcria that ha\e tliis gene are fjiicrolnidcr ci^^i^/onicrdns 
strains. PmUoca a^^^iloDicraiis. Klchsieilci i/croi^cnc.s, and Klchsiellu oxytocci (61 ). some ol w hich are 
opportunistic pathogens of humans, 

hb91c02.S! is related to the high-aiTmit\ outer membi-ane ierriiAamine receptor joxA o\ Ycrsiiiin 
cnterocoUlica (63T05 [60%] similarit\ ) (5 ) and has slighth' less signillcant similaritx wiili the 
lerrichrome-iron receptor precursor of/:', coli (57 9*^^ smiilarUN), Idther thus is a paralog(>us 

comparison between K. coli and S. i\phi or the gene Iuin been under strong selectixe pressure to diverge 
quickh in one or hiUh tif these species. 

Sequences similar to distantly related ()r<j;anisnis. hb59hl 1 .si contains an ORl' related to an accessor} 
colonization factor of llhrio cholercie (similarit\\ 78 1 IS [66''()]) which is probabK related to the 
methx'l-accepting chemotaxis proteins (16). hb56tu6.sl is closeK related to the histidine ammoniadxase 
ihulH) gene o{ Pseiuhmonas pulida (patches totaling 871 17 [74%] amino acid similarit> ) (15). 
hb56h08.sl is similar to a mandelate racemase of A puficia (77 139 [55%] similaritx ) [}_±). 

Three S. ivphi sample sequences ha\e their closest sinularitx in the database with species of the 
phylogeneticalK' \'er\- distant bacterium Hacinopliilus. hb56b()9.sl shares some similaritx with the 
Haemophilus influenzae transport A4Tase protein cvdi ' (51 84 [61%] similaritx ) and a \xeaker similaritx 
\xith the cydC from E. coli (42^69 [61%] similaritx ). fhis latter protein is an ABC (.A f P-binding 
cassette) lamiK' membrane transporter necessary for the formation of the cx lochrome /^c/quinol oxidase 
(40). hb55hl l.sl has xxeak similarities to a leukotoxin secretion A44^-binding protein oU h/e/uophilus 
aelinonivcelenieoniifcuis (42 88 |48'N)| similarUx ) (28) and a much \xeaker sinnlaritx t(^ an coli 
hcmolx sin secivlion A 1 P-bmding proteui not foun*.! m the 1\- i 2 sti'am. 1 hesc are pMi ol ex U)lx tic loxm 
complexes. hb58fl l.sl is similar to a Haemophilus hxpothetical protein of unkno\xn function 
(cumulatix e similarities. 82M27 [65%]). 

hb53fl 1 .si displays a remarkable similaritx- to the chitinase proteins of a number ol" phx logeneticallx 
x'erx- distant bacteria, fhe similaritx to the chitinase o\ AeroDioiias caviae ( 5J ) is 82 147 (5(V'u) extending 
oxei" almost the whole pi\>teiiT. It i> liardi to imagine \\h:i! llie pLa'pi^se olAlie I'ehited gene iviigbit be in .V 
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iyplii. Vc\'\x^\)> llic gciic iii .S. ivplii \\d> a (.tilleicnl MtliNliale. I lie oilier ciilI ol liu.s clone \\a> seieeied ior 
seqLieiieiiiu and pi o\ e^l lo Ik* iiunioluyoLiN tu /^e/A\ ai _ i luin. 

The closest known siniilarities loi" a niiniber of oilier sample sequences oeeiii" in olhei- pin loL!enelieall\ 
\er> distant bacteria. hb57b()4.sl is related to a protein in the irpA region o[ liiicljiicrLUipli'ulicolci 
(88/130 |68%1 similaritN ) (27). The l\inction ofthat piolcin is not known, but it is related to the valA 
(E. coll] and ycxi ' (Ikicillus MihiHis) genes, w hich ai'c ihoughl to encode inlegral nK'nihraiie prtUeins. 
hb62L!().vs] shows sonie similarity lo the 11 \nh!i/i\ choline transport s\stem A I Pase iopn/l l) i\nd prol' 
(60 78 |77| similarit\ ) ( ) but httle sinnlarit\ to genes m co// or Sci/fJioncllcL though there is weak 
similarit) to a h\ potlictiCcd AlU / ti'ansportei- ( i c'//A') m l\ cn/i i ,^ 1 75 |()8'^ (,| snmlaritx ). hh02h()0,s 1 has 
regions of similarity w ith an ORI ' inx oh ed in conjugal transfer in the ^>/7y' region of a streptococcal 
plasmid (52/^4 |55%| similarit) ) (52). b'malK. hb5()el l.sl and hb55al2.sl overlap and share smiilarux 
to the \'oltage-dcpendcnt potassium channel alpha subunit from man\ eukaryotes. The best of these 
similarities is to ( \icm)rhuhdi!is clc^j^ans. where patches olT^O i 02 {(^5%) sin^ilarity ai'c obserxed. This 
sequenee is weakK' homologv>irs (33 04 |^>5'*f)| similarilx ) to a homo log of etik a r\ otic |K>tassium channel 
proteins pre\ iousl\' noted in coli ( 3h). I he other end of tliis clone was sequenced and lound to encode 
cardiolipin s\nthetase. which maps at 28 min in E. coli K-12. 

Candidate insertion/deletion junctions. Some sample sec|uences w ill consist of itmctions (4' 
insertion.Alelelion e\ ents that distinguish .V lyphi from I\. coii i\-12 or from other Sohuoficlhi strains. At 
present, the tO(4s to com cnienth llnd candidate juncliim seL|uences ai'c under construction (^lOa). 
I lowex ei'. we notct.! sc\ en examples b\ \ isuai luspeciioii of nia>tN alignnieni:> of [he A i\ phi cloncN \\ iih 
the E. coli K-12 genoiiie (Table y). Some (4Ahese insertion deletion ewnts ma\ onh he 1 00 bases m 
length, but some ma\' be the iunclions of \er\ large inscilions in S. (yphi. 



TABLE 3. Putative insertion deletion junction fragments 

View this table: 

I in this w indow I \ 

Four other clones in which the sequence read from one end had interesting homt4ogies in the database 
and pooi*er homolog}' with coli K-12 were chosen. !or sequencing from the other cPid of the clone. 
These were hb53fl K containing a chitinase homolog. hb55f()3. containing an indolep\ ru\ate 
decarbox} lase homolog. hb55gl0. containii:ig a retron-associated sequence, and hb56el 1 . containing a 
\()ltage-dependenl potassium channel alpha suhunit homolog ( I able 2). SurprrsmgK. m all foui" cases the 
other end- of the clone was elosel\ homologous U) a know n /. or;// sequence, i nd i cat mg the location o\ 
the junction between unique and shared sequences within the ca. 1.54Nb clone. If these unique I'cgions m 
S. /V/;/// that are not present in /:'. coli were genci-ally many kilobases in length, then cKmics that contained 
unique sequences at one end would generalK also contain unique sequences at the other end, The tact 
that all four clones contained junctions between uniiiue and shared D\A suggests that main of the genes 
that distinguish /:. coli and S. lyplii ma\ be found as single genes or in small groups of a few genes. 
1 hus. the cloning of a number of large pathogenic it\ i d.md^ . ^f I 0 kb or moi-e. c:w\\ ^ onu;;; > 



uciics thai disliiiL^uish ScibnonclUi \vom l\. coIl ina\ lead \o an cxaL^Licralcd iniprL»ioii oi ihc axci'aLic 
lUiinhLM" o.fuciics in each insci'lK)ri Llclclion c\c!il Iv^lwccn ihc^o species. Indeed, wc \ou\\^\ dial n o\ die 
uonoinc contains at least 1 1 mseiHon deletion junclu^n^ lor LiiuqLie sequences o\ ei" 1 00 bp. U\ 
extrapolation, there should be at least 157 such e\ents that distinguish S. ivplii from l\. coli 1\-I2. 

One clone appears to span a junction ofa region that ilitfei's between \. typliiniiinuju and \ ixphi. One 
portion orhb58r()4.sl encodes the S. (yphiniuriiini phosphoglx cerate transport s\steni acti\ ator f 1 ) 
gene (00). whereas aiKnher pordon shows ."^ 1 f 7^'^ Minihintx to a tail spike pn^lein !nun 
bacteriophage P22 (^2 ). 

To confirni such junctions, the ptMlion that is apparenll\ uniL|ue to \. (vphi can be used as a probe m a 
Southern blot oi\S. lypli'nuuriun) or E. coli [)N/\ to deternnne if it is absent in the.NC genomes. 
Alternatively, if the insertion in S. typhi is less than 10 kb. then PGR priniers that are a lew hundred 
bases apart in S. np/iininriiiui or I:, coli should \ ield a much larger \K R product sinuming the insertion 
in X !yp/ii. 

Inipr()\-cmcnts in the starcli stratL'<*>. \lari} examples okNample sequences that were more closeix 
related to sequences in the database other than that oi 1:. coli lx-12 were und(Hibledl\ missed b\ the 
n^ethods used here. For exaniple. while the comparison with coli K-12 is with a complete genome so 
the sample sequence w ill generall\- align o\ er the maximum possible region ofhoiPiologx . the rest of th.e 
database is tragmentarN\ luich time a Sabuojjclla sequence oxerlaps (^nh' parlK at one end ofanother 
sequence in the database, the [^!ast signillcance will be lower even though the re*ji<Mi oldna.lch is 
excellent. Thus limitation would be circunnented b\ a program that could compare sample sequences to 
each other and to the iragmentar) data in the Genlkink database b> using only regions of similarit\ 
shared by all three (or more) sequences rather than the pairw ise comparisons used here. 

dlie comparison of sample sequences from multiple organisms to one or more closel) related completed 
uenomes could be an effectix e strateg} for disco\ ering genes that distinguish species. As anab tical tools 
are improxed. it should be possible to ask e\en more .^o]lln^li^ated questions widi sample .sequence data, 
for exaniple. for pathogenic bacteria, one of the most interesdng applications would be to compare the 
rates of e\'olution of loci across the genome. Cell surface proteins in pathogenic bacteria are exposed to 
the immune sx stem of the host, fhis can lead to a selecti\ e pressure that is greater than that experienced 
b\' most other genes in the genonu^ {^). Anal\ tical tools that could align multiple sequences and then 
compare the rates ol\"\ olution of syncnix'mous and nonsynon\mous codons onl>' in the regions shared b\ 
all of the sample sequences would allow detection of candidate loci that may ha\e undergone accelerated 
e\-olution. \\liere these loci exist the\ wtuild be td'pailicular interest lor further smd\ as poicndall} \ iial 
parts of tlie pathogenic mechanism and as immunologic targets. 

Pathways and structures. Databases of the known metabolic pathw a\ s in bacteria and homologs for 
genes in these pathways hax'c been assembled (2^. 24. 4S. 40r .\s these databases grow to encompass all 
functions of the cell, one potential use of sample sequences i.s to determine w hether metabolic or 
transport pathways, signal transduction pathwaxs. ox parliculai- plwsical structures are present in a 
bacieriutn. 
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or strucuirc is present m a baclL'riLtni / C crlainl> . lo^s than liic coniplctc ycnonic is ncccssar\ l)ccaiiNc inic 
need sample only one or a lew iieiK^s in a palln\a\ oi' slrucUire lo be et)ntideiil thai the palliwax or 
slruelLire is present. 1 Lirlherinore. one does not need U) delerinine die eoniplete sequenee ol an ( )KI il il 
is a elose honiolog of a known iiene in another haeleriuni. Perhaps one need SLimple a streleh t)t'onl> 
50 amino aeids (or*^)S% aeeurale seqiienee) \\■on^ a Liene to W able lo assiiin diose thai ha\e a elose 
homoU^g alread\ in die database. As ouir knowledge o\ pathwa\s and struetiires gro\\s- it should be 
possible lo deterninie the probable presenee (or probable absenee) ol'an inereasmy number e\en widi 
onl\ 5()^) ofdie genome repivsenteil in a sample se^jueiiee, I li-diei'moiv. as more sm^pte -eqiienee- aiv 
obtained, il shi^ald be easier lo assign homologs. Sueh a sample sec]uenee eaii be obtained lor as little as 
S25-0()0 at a sequeneing eenler. and the priee ean be expeeled to eonliniie to !all. 

It is interesting that beeause genes tend to be eluslered in eistrons. a shotgun sample sequenee ot ol 
a genome is better for these purposes than a eomplete sequence oWmc half of the genome, as well as 
being mueli less expensixe. The latter sirateg\ \\ ilhiii eiiii; j usli'tuis in the halfvM^ l!:.- ^■..i^^ine di.;i 
not been sequenced. In contrast, the sht)igun approach will s.imple a small portion i)r\ irluall\ all 
eisU'ons. 

In conclusion, although there are certain limitatiiins of sample sequences, these hmitalions are more than 
counterbalanced by the know ledge that can be gained at \'er\ low cost. It is hoped that sample 
sequencine will begin in earnest and that the bioinformalics needed to liilh exploit sample sequences 
Will be de\ eloped. 
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